{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "a19sSgshu-SA",
    "papermill": {
     "duration": 0.024906,
     "end_time": "2021-07-23T16:17:55.704014",
     "exception": false,
     "start_time": "2021-07-23T16:17:55.679108",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Double/Debiased Machine Learning for the Partially Linear Regression Model\n",
    "\n",
    "This is a simple implementation of Debiased Machine Learning for the Partially Linear Regression Model, which provides an application of DML inference to determine the causal effect of countries' intitial wealth on the rate of economic growth.\n",
    "\n",
    "\n",
    "Reference:\n",
    "\n",
    "- https://arxiv.org/abs/1608.00060\n",
    "- https://www.amazon.com/Business-Data-Science-Combining-Accelerate/dp/1260452778\n",
    "\n",
    "The code is based on the book."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "x4283h0kwo_M"
   },
   "outputs": [],
   "source": [
    "# Import relevant packages\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.model_selection import cross_val_predict, KFold\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "import warnings\n",
    "from sklearn.base import BaseEstimator\n",
    "import statsmodels.formula.api as smf\n",
    "warnings.simplefilter('ignore')\n",
    "\n",
    "np.random.seed(123)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "g6JCZCdqvj1Z"
   },
   "outputs": [],
   "source": [
    "file = \"https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/GrowthData.csv\"\n",
    "data = pd.read_csv(file)\n",
    "data = data.loc[:, ~data.columns.str.contains('^Unnamed')]  # get rid of index column\n",
    "data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5xhpQNhX9GUk"
   },
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xS0HvaqY9Ff7"
   },
   "source": [
    "Construct the outcome variable, the treatment variable and the matrix $x$ that includes the control variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "iqL4LxYJ9ICj"
   },
   "outputs": [],
   "source": [
    "y = data[\"Outcome\"]\n",
    "d = data[\"gdpsh465\"]\n",
    "x = data[data.columns[~data.columns.isin(['gdpsh465', 'intercept', 'Outcome'])]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "CP0rMYIX9ej-"
   },
   "outputs": [],
   "source": [
    "# some summary statistics\n",
    "print(\"The length of y is: \", y.shape[0])\n",
    "print(\"The number of features in x is: \", x.shape[1])\n",
    "# naive OLS\n",
    "all_columns = \"+\".join(data.iloc[:, 2:].columns)\n",
    "my_formula = \"Outcome ~ \" + all_columns\n",
    "ols_naive = smf.ols(formula=my_formula, data=data).fit()\n",
    "print(\"Naive OLS that uses all features w/o cross-fitting Y ~ D+X yields: \",\n",
    "      ols_naive.params[1], \"(\", ols_naive.bse[1], \")\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ubu-QI2Ju-SB",
    "papermill": {
     "duration": 0.024711,
     "end_time": "2021-07-23T16:17:55.803109",
     "exception": false,
     "start_time": "2021-07-23T16:17:55.778398",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## DML algorithm\n",
    "\n",
    "Here we perform estimation and inference of predictive coefficient $\\alpha$ in the partially linear statistical model,\n",
    "$$\n",
    "Y = D\\alpha + g(X) + U, \\quad E (U | D, X) = 0.\n",
    "$$\n",
    "For $\\tilde Y = Y- E(Y|X)$ and $\\tilde D= D- E(D|X)$, we can write\n",
    "$$\n",
    "\\tilde Y = \\alpha \\tilde D + U, \\quad E (U |\\tilde D) =0.\n",
    "$$\n",
    "Parameter $\\alpha$ is then estimated using cross-fitting approach to obtain the residuals $\\tilde D$ and $\\tilde Y$.\n",
    "The algorithm comsumes $Y, D, X$, and machine learning methods for learning the residuals $\\tilde Y$ and $\\tilde D$, where\n",
    "the residuals are obtained by cross-validation (cross-fitting).\n",
    "\n",
    "The statistical parameter $\\alpha$ has a causal interpretation of being the effect of $D$ on $Y$ in the causal DAG $$ D\\to Y, \\quad X\\to (D,Y)$$ or the counterfactual outcome model with conditionally exogenous (conditionally random) assignment of treatment $D$ given $X$:\n",
    "$$\n",
    "Y(d) = d\\alpha + g(X) + U(d),\\quad  U(d) \\text{ indep } D |X, \\quad Y = Y(D), \\quad U = U(D).\n",
    "$$\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "19iFcY8o_AnT"
   },
   "outputs": [],
   "source": [
    "def dml(X, D, y, modely, modeld, *, nfolds, classifier=False, time=None, clu=None, cluster=True):\n",
    "    '''\n",
    "    DML for the Partially Linear Model setting with cross-fitting\n",
    "\n",
    "    Input\n",
    "    -----\n",
    "    X: the controls\n",
    "    D: the treatment\n",
    "    y: the outcome\n",
    "    modely: the ML model for predicting the outcome y\n",
    "    modeld: the ML model for predicting the treatment D\n",
    "    nfolds: the number of folds in cross-fitting\n",
    "    classifier: bool, whether the modeld is a classifier or a regressor\n",
    "\n",
    "    time: array of time indices, eg [0,1,...,T-1,0,1,...,T-1,...,0,1,...,T-1]\n",
    "    clu: array of cluster indices, eg [1073, 1073, 1073, ..., 5055, 5055, 5055, 5055]\n",
    "    cluster: bool, whether to use clustered standard errors\n",
    "\n",
    "    Output\n",
    "    ------\n",
    "    point: the point estimate of the treatment effect of D on y\n",
    "    stderr: the standard error of the treatment effect\n",
    "    yhat: the cross-fitted predictions for the outcome y\n",
    "    Dhat: the cross-fitted predictions for the treatment D\n",
    "    resy: the outcome residuals\n",
    "    resD: the treatment residuals\n",
    "    epsilon: the final residual-on-residual OLS regression residual\n",
    "    '''\n",
    "    cv = KFold(n_splits=nfolds, shuffle=True, random_state=123)  # shuffled k-folds\n",
    "    yhat = cross_val_predict(modely, X, y, cv=cv, n_jobs=-1)  # out-of-fold predictions for y\n",
    "    # out-of-fold predictions for D\n",
    "    # use predict or predict_proba dependent on classifier or regressor for D\n",
    "    if classifier:\n",
    "        Dhat = cross_val_predict(modeld, X, D, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]\n",
    "    else:\n",
    "        Dhat = cross_val_predict(modeld, X, D, cv=cv, n_jobs=-1)\n",
    "    # calculate outcome and treatment residuals\n",
    "    resy = y - yhat\n",
    "    resD = D - Dhat\n",
    "\n",
    "    if cluster:\n",
    "        # final stage ols clustered\n",
    "        dml_data = pd.concat([clu, pd.Series(time),\n",
    "                              pd.Series(resy, name='resy'),\n",
    "                              pd.Series(resD, name='resD')], axis=1)\n",
    "    else:\n",
    "        # final stage ols nonclustered\n",
    "        dml_data = pd.concat([pd.Series(resy, name='resy'),\n",
    "                              pd.Series(resD, name='resD')], axis=1)\n",
    "\n",
    "    if cluster:\n",
    "        # clustered standard errors\n",
    "        ols_mod = smf.ols(formula='resy ~ 1 + resD', data=dml_data)\n",
    "        ols_mod = ols_mod.fit(cov_type='cluster', cov_kwds={\"groups\": dml_data['CountyCode']})\n",
    "    else:\n",
    "        # regular ols\n",
    "        ols_mod = smf.ols(formula='resy ~ 1 + resD', data=dml_data).fit()\n",
    "\n",
    "    point = ols_mod.params[1]\n",
    "    stderr = ols_mod.bse[1]\n",
    "    epsilon = ols_mod.resid\n",
    "\n",
    "    return point, stderr, yhat, Dhat, resy, resD, epsilon"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "gZ7hjIAn_YKR"
   },
   "outputs": [],
   "source": [
    "def summary(point, stderr, yhat, Dhat, resy, resD, epsilon, X, D, y, *, name):\n",
    "    '''\n",
    "    Convenience summary function that takes the results of the DML function\n",
    "    and summarizes several estimation quantities and performance metrics.\n",
    "    '''\n",
    "    return pd.DataFrame({'estimate': point,  # point estimate\n",
    "                         'stderr': stderr,  # standard error\n",
    "                         'rmse y': np.sqrt(np.mean(resy**2)),  # RMSE of model that predicts outcome y\n",
    "                         'rmse D': np.sqrt(np.mean(resD**2))  # RMSE of model that predicts treatment D\n",
    "                         }, index=[name])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FZp5UE3cCWkE"
   },
   "source": [
    "We now run through DML using as first stage models:\n",
    " 1. OLS\n",
    " 2. (Rigorous) Lasso\n",
    " 3. Random Forests\n",
    " 4. Mix of Random Forest and Lasso"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8ofBdw_eB72p"
   },
   "source": [
    "Run the following command to install hdmpy for rigorous lasso:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "P1Np-LG_Z_EA"
   },
   "outputs": [],
   "source": [
    "!pip install multiprocess\n",
    "\n",
    "\n",
    "!git clone https://github.com/maxhuppertz/hdmpy.git"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "aaX6Cs5YCc2e"
   },
   "outputs": [],
   "source": [
    "import hdmpy\n",
    "\n",
    "\n",
    "class RLasso(BaseEstimator):\n",
    "\n",
    "    def __init__(self, *, post=True):\n",
    "        self.post = post\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        self.rlasso_ = hdmpy.rlasso(X, y, post=self.post)\n",
    "        return self\n",
    "\n",
    "    def predict(self, X):\n",
    "        return np.array(X) @ np.array(self.rlasso_.est['beta']).flatten() + np.array(self.rlasso_.est['intercept'])\n",
    "\n",
    "\n",
    "def lasso_model():\n",
    "    return RLasso(post=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ql64rlMt_I9V"
   },
   "outputs": [],
   "source": [
    "# DML with OLS:\n",
    "modely = make_pipeline(StandardScaler(), LinearRegression())\n",
    "modeld = make_pipeline(StandardScaler(), LinearRegression())\n",
    "result_OLS = dml(x, d, y, modely, modeld, nfolds=10, classifier=False, cluster=False)\n",
    "table_OLS = summary(*result_OLS, x, d, y, name='OLS')\n",
    "\n",
    "# DML with RLasso:\n",
    "modely = make_pipeline(StandardScaler(), RLasso(post=False))\n",
    "modeld = make_pipeline(StandardScaler(), RLasso(post=False))\n",
    "result_RLasso = dml(x, d, y, modely, modeld, nfolds=10, classifier=False, cluster=False)\n",
    "table_RLasso = summary(*result_RLasso, x, d, y, name='Lasso')\n",
    "\n",
    "\n",
    "# DML with Random Forests\n",
    "modely = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=123))\n",
    "modeld = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=123))\n",
    "result_RF = dml(x, d, y, modely, modeld, nfolds=10, classifier=False, cluster=False)\n",
    "table_RF = summary(*result_RF, x, d, y, name='RF')\n",
    "\n",
    "# DML with Mix:\n",
    "modely = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=123))\n",
    "modeld = make_pipeline(StandardScaler(), RLasso(post=False))\n",
    "result_mix = dml(x, d, y, modely, modeld, nfolds=10, classifier=False, cluster=False)\n",
    "table_mix = summary(*result_mix, x, d, y, name='RF/Lasso Mix')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "oJcUH1gcBt9U"
   },
   "outputs": [],
   "source": [
    "table = pd.concat([table_OLS, table_RLasso, table_RF, table_mix], axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "YZ0RFqV3DMJ7"
   },
   "outputs": [],
   "source": [
    "print(table)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  },
  "papermill": {
   "default_parameters": {},
   "duration": 663.673787,
   "end_time": "2021-07-23T16:28:56.086168",
   "environment_variables": {},
   "exception": null,
   "input_path": "__notebook__.ipynb",
   "output_path": "__notebook__.ipynb",
   "parameters": {},
   "start_time": "2021-07-23T16:17:52.412381",
   "version": "2.2.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
